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ABSTRACT 

We investigate machine learning (ML) techniques for predicting the number of galaxies {Ngai) that 
occupy a halo, given the halo's properties. These types of mappings are crucial for constructing the 
mock galaxy catalogs necessary for analyses of large-scale structure. The ML techniques proposed here 
distinguish themselves from traditional halo occupation distribution (HOD) modeling as they do not 
assume a prescribed relationship between halo properties and Ngai- In addition, our ML approaches are 
only dependent on parent halo properties (like HOD methods), which arc advantageous over subhalo- 
based approaches as identifying subhalos correctly is difficult. We test 2 algorithms: support vector 
machines (SVM) and k-nearest-neighbour (kNN) regression. We take galaxies and halos from the 
Millennium simulation and predict Ngai by training our algorithms on the following 6 halo properties: 
number of particles, M200, <^v, Vmax, half-mass radius and spin. For Millennium, our predicted Ngai 
values have a mean-squared-error (MSE) of ^ 0.16 for both SVM and kNN. Our predictions match 
the overall distribution of halos reasonably well and the galaxy correlation function at large scales to 
^ 5 — 10%. In addition, we demonstrate a feature selection algorithm to isolate the halo parameters 
that are most predictive, a useful technique for understanding the mapping between halo properties 
and Ngai- Lastly, we investigate these ML-based approaches in making mock catalogs for different 
galaxy subpopulations (e.g. blue, red, high Mgtar, low Mstar)- Given its non-parametric nature as 
well as its powerful predictive and feature selection capabilities, machine learning offers an interesting 
alternative for creating mock catalogs. 

Subject headings: methods:numerical, galaxies: halos, cosmology:large-scale structure of universe 



1. INTRODUCTION 

As we enter the era of large-scale structure experi- 
ments such as LSST, WFIRST and Euclid, the creation 
of reliable mock galaxy catalogs will become increas- 
ingly more important. Such catalogs are essential for 
correctly characterizing the expected errors in the anal- 
yses of these datasets, calibrating analysis pipelines and 
ultimately measuring cosmological parameters (such as 
the dark e nergy equation of stat e) from galaxy cluster- 
ing (e.g. [Anderson et al.l I2012D . Making mock cata- 
logs for different subpopulations of galaxies (e.g. blue 
versus red, high Mstar versus low Mstar, etc.) to 
study their clustering properties is also of utmost im- 
portance f or understanding galaxy forinatio n and evolu- 
tion (e.g. ICoil et al.l[2008[ IGuo et al.l[20T2l) . Although 
these mock catalogs can be generated relatively quickly 
using pertur bation theory-b ased approaches such as that 
described in iManera et al.l p013.) , it is well known that 
these approximations break down at small scales (e.g. 
iCarlson. White fc Padmanabhanll2009l ). 

Alternatively, mock catalogs can be created using sim- 
ulations which capture non-linear structure growth to 
smaller scales, and inherently include redshift-spacc dis- 
tortions (velocity information of the dark matter parti- 
cles is known). Ideally we would like large- volume cos- 
mological simulations with both N-body and hydrody- 
namics, however, such simulations are computationally 
expensive. Hence, the large number of mocks neces- 
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sary for obtaining robust error measurements renders 
this approach impractical. Fortunately, since pure N- 
body simulations are relatively inexpensive, the litera- 
ture has been rife with methods for populating the dark 
matter halos found in these simulations with galaxies. 
One of the most popular methods for doing this is us- 
ing halo occupation distributions (HODs) which is an 
analytic model for determining the number of galaxies 
(Ngai ) that should form in a halo given its properties 
(e.g. iZheng et al.l 120091 ). More recently, subhalo abun- 
dance matching (SHAM) has beconi e very prevalent (e.g . 
IConrov. Wechsler fc KravTswl [20071: IVimal et al.l [20101) . 
This method relies on being able to correctly identify 
the subhalos within a halo, a very difficult problem 
in its own right due to resolut ion limitations (see e.g. 
iBehroozi. Wechsler &: Wull2011| ). It then assumes that 
each subhalo contains a single galaxy with a stellar mass 
or luminosity that is monotonically related to a subhalo 
property. 

While both methods have been shown to produce 
mock galaxy catalogs that match observations reason- 
ably well, it would be progressive to attempt to elim- 
inate the above mentioned assumptions. The sophisti- 
cated non-parametric regression algorithms that form a 
subcategory of "machine learning" (ML) are ideal for 
this purpose. To obtain the mapping from halos to Ngai, 
the only assumptions that these model-independent algo- 
rithms require are that such a relationship exists and that 
it is a continuous function of the halo parameters. They 
then proceed to construct a model from the data itself 
and hence do not impose any pre-supposed relationships 
onto the data. We note that although ML algorithms 
arc non-parametric in the sense that we do not need to 
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assume a known relationship between data parameters, 
they do often require us to assume some operational pa- 
rameters such as how severely poor predictions are pe- 
nalized. These, however, can be optimized as described 
in 

In addition, the ML-based approaches we propose here, 
like HOD-based methods, rely only on the properties of 
the parent dark matter halo. Hence, we can circumvent 
the difhculties in subhalo identification. Another point 
worth highlighting is that in principle, these techniques 
can be trivially extended to understand how halo proper- 
ties map onto different subpopulations of galaxies. This 
provides a method for making mock catalogs for these 
different subpopulations as well. 

ML algorithms do, however, need to be trained on 
large, accurate datasets in order for them to learn robust 
mappings between halo properties and galaxy properties 
such as Ngai : a large- volume N-body plus hydro simula- 
tion with reliable galaxy formation would be ideal. At 
present, we have not been able to acquire such a simula- 
tion and so we use the Millennium simulation with semi- 
analytic galaxy formation for this study. However, it is 
conceivable that such a simulation will become available 
in the near future which can be used to characterize the 
halo-galaxy mapping via the ML techniques discussed in 
this work. 

A final point of interest rests in the observation that 
most HOD methods typically use the halo mass as the 
only parameter in their halo-to-galaxy mappings. An im- 
portant topic to study is the sensitivity of Ngai to other 
halo parameters. For example, there have been investi- 
gations hinting that the environment of the halo is also 
an important factor in determ i ning how rnany g alaxies 
will form (e.g. IGao et al.l[20?)5l : ICroft et al.ll20m . This 
information can be gleaned using ML techniques as well, 
through performing a "feature selection" which picks out 
the halo properties that best predict Ngai- 

In § [2] we describe the dataset we use derived from 
the Millennium simulations. In §|3]we describe the 2 ML 
techniques we employ to learn the mapping between halo 
properties and Ngai- In §|l]we describe our results, i.e. 
how well our predictions match the actual values from 
Millennium. This section also includes a discussion on 
using ML to make mocks for different subpopulations of 
galaxies. We conclude in § [S] 

2. DATASET 

We construct our dataset from halo and galaxy cat- 
alogs at z = der i ved from the Millen nium simula- 
tion (jSpringell [20051: iSpringel et al.l I2005D . These cat- 
alogs are obt ained via querying the Millennium on- 
line database (jLemson et al.llToOGi ). The halo catalogs 
were generated using a friends-of-friends (FoF) algo- 
rithm with linking length of 0.2 and the semi-analytic 
galaxy pre scription used to po p ulate these halo s is de - 
scribed in iCroton et all j2006D : iDe Lucia et"all (|2006D : 
iDe Lucia fc BlaizotI ( 2007 ). The Millennium simulation 
is run with 2160^ particles in a 500ft.~^Mpc box. The cos- 
mology employed has ilm ~ 0.25, ilb = 0.045, Ha = 0.75, 
h = 0.73, Us = 1 and as — 0.9. 

To obtain our dataset, we search through the Millen- 
nium halo catalog and extract all primary halos (FoF 
groups) with mass greater than IO^'^H'^Mq (at present. 



we are unlikely to observe anything less massive except 
in the local Universe). We then match galaxies from the 
semi-analytic catalog to these halos, keeping only the pri- 
mary galaxies of a halo or subhalo (i.e. those flagged or 
1 in the Millennium database). Hence, we emerge with a 
halo catalog listing the following 7 parameters: number 
of particles in the halo Np, M200, velocity dispersion (t.„, 
maximum circular velocity Vmax, half-mass radius R1/2, 
spin and number of galaxies in the halo Ngai- Our goal 
is to train a machine learning algorithm to predict Ngai 
using the other 6 halo parameters. 

The semi-analytic model used to populate the Millen- 
nium halos with galaxies is dependent on various thresh- 
olds (such as for gas accretion and star formation) that 
are also evolved through time. This quality makes Mil- 
lennium an adequate testing ground for ML applications 
because the mapping from halo parameters to Ngai is 
much more complicated than the straight-forward func- 
tions normally employed in methods such as HOD and 
SHAM. The complexity of these mappings should be 
closer to the level we expect from actual N-body sim- 
ulations with hydrodynamics. 

There are 395,832 halos (with 445,983 total galaxies) 
in our sample which we use for our basic tests of the ML 
algorithms. However, since the Millennium semi-analytic 
model provides &, w, r, i, z magnitudes and stellar masses 
for the galaxies, we can also use these same halos to 
learn the mapping between halo properties and Ngai for 
different subpopulations of galaxies. We perform tests 
of the ML algorithms after splitting the halo sample on 
colour and stellar mass (a proxy for luminosity) in i|4.1l 
and respectively. 

3. MACHINE LEARNING ALGORITHMS 

We test 2 different machine learning (ML) algorithms 
for predicting Ngai from the halo parameters Np, A/200, 
o'v, Vmax, Ri/2 ^ud Spin. The first is a support vector 
machine (SVM) and the second is a k-nearest-neighbours 
(kNN) routine, both described below. They work by 
"learning" a relationship between a set of input features 
X and the value we're interested in predicting Y- SVM 
and kNN are both non-parametric in the sense that we 
do not need to assume a model as traditional methods 
for populating halos with galaxies do. The mapping 
is constructed using information in the data itself: the 
data picks the model best suited to it. In addition, we 
avoid the messy problem of subhalo finding which is a 
required step in any SHAM-based approach; like HOD- 
based models, our proposed ML techniques operate on 
the parent halo itself. 

The learning process is accomplished by "training" the 
algorithm on a set of training data where Y is known. 
The learned mapping can then be applied to a test set 
of X values with known Y to verify its accuracy. If the 
learned relationship appears robust, it can be applied to 
a set of X with unknown Y to make predictions. 

In testing the accuracy of the predicted values, we draw 
on the mean-squared-error (MSE) defined as 

T\/rOT^ ^^i—li^i-test.true ^i^test^predicted) , \ 

MbE = — (1) 

where is the number of test data points. This effec- 
tively measures a combination of variance (the scatter in 
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the predicted values) and bias (how different from truth 
the predicted values are). 

3.1. Support Vector Machines 

Support Vector Machines (SVMs) work by map- 
ping the features X to a higher dimensional-space 
and attempting to separate them into regions that 
map onto specific Y va lues using a set of hyperplanes 
(jCortes fc Vapniklll995[) . In its original form, it is a clas- 
sification scheme but can be generalized to a regression 
algorithm (Drucker et al. 1993). 

The general idea behind SVM is best illustrated 
through the case of a binary classifier. The binary SVM is 
trained on a set of input features X = {Xi,X2,X3...Xd}, 
where d is the total number of features, to classify data 
into one of two classes F = { — 1, 1}. Consider a set of N 
training data, each with a corresponding column vector 
of features X^ and a corresponding class li(Xi) = ±1. 
An SVM attempts to separate the training data into their 
appropriate classes using 2 parallel hyperplanes in a high- 
dimensional space. These planes can be written as 



W-X-fe = ±l 



(2) 



where W is the normal vector to the hyperplane and b 
is some constant scalar analogous to a y-intercept in 2D. 
The 2 planes must satisfy the condition that no points 
fall in between them, i.e. 



W • X, - 6 ^ 1 
for Xi of the first class (i.e. Y(X.i) = l) or 
W • X, - 5 ^ -1 



(3) 



(4) 



for Xi of the second class. Note that this can be simply 
re-written as 

y,(w-x, -6) ^ 1. (5) 

The region bounded by these 2 planes is known as the 
margin, which has width 

^ = P^- (6) 

The best classifier is obtained through maximizing this 
distance or minimizing ||W|| subject to the constraint in 
Equation ([5]) since in this limit we will obtain the most 
robust separation of the datapoints. For computational 
purposes, we actually end up minimizing i||W|p. The 
optimization can be performed using Lagrange multipli- 
ers (cki). This leads to minimizing ||W|| with respect to 
the Lagrangian 

1 ^ 

^("0 = 2l|W|r - ^a4y.(W • X. - 6) - 1] (7) 

i=i 

subject to the constraints > 0. Taking the deriva- 
tive of this Lagrangian with respect to ||W|| yields the 
solution 

N 

(8) 



|W|H^a,K,X, 



Values for the a, can be obtained by substituting this 
solution back into Equation (I7|) which yields 

N ^ N N 



and maximizing L{ai) with respect to the a;. It turns 
out that only a few of the Ui are non-zero. These cor- 
respond to the training points that satisfy the equality 
condition in Equation ([5]) . Such points are called the sup- 
port vectors and set the value of 6, i.e. (6 = W • X^ — F^). 
In practice, h is taken to be an average over the support 
vectors, that is 



, Nsv 



(10) 



i=l 



where Nsv is the number of support vectors. 

It is often useful to introduce some slack (quantified 
by below) into the SVM classifier. This amounts to 
replacing the constraint in Equation ([Sj with the new 
constraint 

y,(w.x, -5) ^ 1-e,. (11) 

The effective reduction of the margin width in the above 
equation allows for some misclassification of the data. 
We then minimize 



N 



^liwil- 



(12) 



where C is a parameter that determines the penalty for 
any misclassification. If we solve the Lagrangian for this 
case, we will find that = cti/C is required. 

In addition, one can see that Equation ([9]) contains 
the term X^Xj which is just the dot product between 
two vectors in feature space. We can then imag- 
ine generalizing this dot product to the dot product 
in a space s panned by a non-linear mapping ( $) of 
the features ([Aizerman. Brayerman fc Rozonoeii 119641 : 
iBoser. Guvon fc VapnikI I1992D . ThiT mapping can be 
used to take the features into a higher dimensional space, 
making them more easily separable. The dot prod- 
uct <I>(Xi)<I>(Xj) can be thought of as a kernel function 
fc(Xi,Xj). Commonly used kernels like the polynomial, 
Gaussian (or radial basis function, rbf ) and sigmoid func- 
tions are popular due to their simplicity. These have the 
forms 

fcpo;y(Xi, Xj) = (XiXj)™ (13) 
where m is the degree of the polynomial, 



/cr&/(Xj,Xj) = exp(-7||Xi 



and 



k sigmoid (X^ , Xj j 



tanh(Xi • Xj 



(14) 
(15) 



where 7 and r are parameters of the kernel. 

The simple binary SVM described abov e can be gener- 
alized to support vector regression (SVR) (jPrucker et al.l 
|1997[ ) which is the algorithm we employ in this study. For 
the regression problem, we seek hyperplanes satisfying 
the equations 



- (w • X, - 6) ^ e + 

(W.X,-6)-K,^e + C 



(16) 
(17) 



Here, e is a tolerance parameter, i.e. there is no penalty 
assigned to predictions that fall within e of the true value. 

> and > are slack variables corresponding to 
upper and lower constraints on the system output. 
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The quantity the SVR must minimize is 
1 ^ 

-iiw|p + c5^(e,+c)- (18) 

■i=i 

which closely resembles Equation (|12l) . The Lagrangian 
takes the form 

1 ^ 
L(a,,a*) = 2IIWIP + +er) (19) 

4=1 

N 

-^a,[e + e<-y4 + (W-X, + 6)] 
1=1 

-E"ne+er + y«-(w-x,+6)] 

1=1 

where ai and a^* arc Lagrange multipliers subject to 

the constraints < at < C, < a* < C and J2iLii^i ~ 
a*)=0. 

For our analysis, we use the SVR algorithm 
implemented in the scikit-learn Python library 
(jPedregosa et al.ir201lD . We use the default e = 0.1 and 
find that changing this value does not make a noticeable 
difference in our predictions or the resulting MSE. The 
algorithm takes in a value for the penalty parameter C 
from Equation (jlSp and the kernel, m, 7 and r need 
also be specified depending on what kernel is being 
used. Optimal parameter values can be determined by 
splitting our sample of halos into three equal parts: a 
training set, a validation set and a test set. We can 
then train the SVM using some pre-determined grid 
values: C = {100, 10^} and 7 or r = {0.1, IQ-^} in 
powers of 10, and calculate the MSE of the validation 
set. The parameter values that give the minimum MSE 
are chosen for our analyses, where we evaluate how well 
the ML predictions match truth using the test set. 

3.2. k-Nearest-Neighbours 

The k-Nearest-Neighbours (kNN) algorithm is much 
simpler than the SVM to understand and can also be 
used for both classification and regression. The kNN rou- 
tine calculates "distances" to the k nearest training data 
points for each point that we are interested in pre- 
dicting Yi. This distance is often just a simple Euclidean 
distance between the features X, however, one can imag- 
ine using other definitions as well. The predicted Yi is 
then just the average of the k nearest training set Y val- 
ues. This average can be weighted according to distance 
such that points further away have less impact on the 
predicted value. Mathematically, one can represent this 
as 

Y. = S.^(^(x..,x.))y. ^^^^ 

k 

where d(Xj,,X,j) is the distance between Xj and one of 
its k nearest neighbours X^., and w{d) is the weight cor- 
responding to that distance. 

We again use the sc ikit-learn implementation of kNN 
(jPedregosa et al.ll2011[ ). The algorithm takes in a value 
for k. Again, the optimal value can be found by step- 
ping through a pre-determined set k = {3, 6, 21, 24} 



and picking the value with the lowest MSE when the 
algorithm is applied to the validation set. 

3.3. Feature Selection 

Feature selection is the process by which we select 
which features are the most relevant for predicting Y 
from X. A simple approach to this is forward fea- 
ture selection which relies on an initial comparison to 
the base MSE, or the MSE calculated by taking all 
Yi^test,predicted = {Ytrain) in Equation (H]), i.e. 

Base MSE = T.tl(X^-test,true - {Ytra^n)f 

N ^ ' 

This is just the MSE one would obtain by doing the most 
naive thing: predicting all Y values to be the mean Y of 
the training set (denoted as (Ytrain) above. 

Forward feature selection starts by training an ML al- 
gorithm to predict Y using only a single feature. We 
repeat this for each individual feature and calculate its 
MSE value from a test set. If the minimum MSE is less 
than the base MSE, then we are doing better than the 
naive prediction, indicating that the features do contain 
information that is correlated with and can help us pre- 
dict Y . We then "select" the feature that produced the 
minimum MSE and individually add each other feature 
to it and repeat the train/test procedure to calculate 
MSE values. At the end of this round, if the minimum 
MSE is smaller than the minimum MSE in the previous 
step, we again select the feature that produced this min- 
imum MSE and repeat the previous procedure. At each 
step if adding in an additional feature decreases the min- 
imum MSE further, we continue. Otherwise wc stop and 
deem the remaining features as not having much predic- 
tive power beyond the "selected" features. 

Such feature selection schemes are useful for identifying 
the halo parameters that are most relevant to inferring 

Ngal. 

4. RESULTS 

As described in § [21 our tests use a sample of 395,832 
halos from the Millennium simulation to assess the ma- 
chine learning algorithms detailed in the preceding sec- 
tion. We randomly split this halo sample into three equal 
parts and use the first part for training, the second part 
for validation and the third part for testing. The base 
MSE of the test set is 0.505. We first look at the re- 
sults obtained through training the ML algorithms on all 
6 halo features. 

As mentioned in ^3.1[ we do a grid-search to find the 
SVM training parameters (C, 7 and kernel) that return 
the minimum MSE on the validation set. This procedure 
selected the rbf kernel with C = 1000 and 7 = 0.0001. 
We then use the SVM trained using these values to make 
predictions from the test set. For kNN, we use a similar 
search technique (described in i|3.2p and find that using 
fc = 12 gives the minimum MSE on the validation set. 
The kNN test set results below are derived using this 
value. 

A set of 2D histograms in Ngai.tme versus Ngai. predicted 
from our test set is shown in Figure [T] The top panel 
shows the SVM result and the bottom panel shows the 
kNN result. One can see that in both cases, the MSE 
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Fig. 1. — 2D histograms of machine-learning-predicted number 
of galaxies per halo (Ngai) versus the true number. Here we have 
taken the 395,832 halos with M > IO^^/i-^Mq from the Millen- 
nium simulation and split them randomly and equally into train- 
ing, validation and test sets. We then use the following features to 
predict the mapping from halo properties to Ng^i- number of par- 
ticles Np, Af200, CtJ, maximum circular velocity Vmax, half-mass 
radius R1/2 ^^'^ halo spin. The "goodness" of the prediction is 
indicated by the MSE (see Equation [TJ which is essentially a com- 
bined measure of variance and bias. The dashed black line in each 
panel indicates the 1:1 lino, (top) Results from the support vector 
machine (SVM) algorithm, (bottom) Results from the k-ncarcst- 
neighbours (kNN) method. One can see that the MSE is fairly 
small in both cases. For comparison, the base MSE (the MSE one 
would obtain if one always predicted Ng^i to be the average of the 
training set) is ~ 0.505, which is a factor of a few larger than the 
MSE values derived from our ML-predicted Ng^i- There appears 
to be a small bias towards under-prediction of Ng^i , however this 
does not seem to detract significantly from our ability to create 
large-scale structure mocks. 



is dramatically improved over the base MSE which indi- 
cates that the ML algorithms are learning some informa- 
tion about Ngai from the input features as expected. The 
MSE values from the 2 different methods are very similar 
and hence SVM and kNN appear to be equally good for 
inferring Ngai from halo features. However, we note that 
upon careful inspection of the 2D histograms, one sees 
that there is a slight bias towards under-predicting Ngai 
which is discussed more below. 

ML algorithms appear to match the expected distribu- 
tion of Nfiaio as a function of Ngai quite well as shown in 
Figure [2j The panels show histograms where the y-axes 
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Fig. 2. — Distribution of training, true and predicted test Ngai 
per halo from the Millennium simulation, (top) SVM results, (bot- 
tom) kNN results. One can see that the overall distribution of 
galaxies predicted using the ML algorithms track the true values 
well. It appears that we slightly overpredict the number of halos 
with Ngai = 1 a-nd underpredict elsewhere, however, this does not 
seem to significantly affect our ability to make mock catalogs for 
large-scale structure analyses. 



correspond to the fraction of halos with Ngai and the 
a;-axes correspond to Ngai- The top panel was obtained 
using the SVM method and the bottom panel shows the 
analogue for kNN. We slightly overpredict the number of 
halos with 1 galaxy, and underpredict elsewhere, espe- 
cially when using the SVM. The true number of galaxies 
in the test set is 149,064; for SVM we predict 140,519 
and for kNN we predict 144,346. This phenomenon was 
pointed out previously and will be elaborated on below. 

The galaxy correlation function ^(r) can also be used 
to test the robustness of the ML results. This test is es- 
pecially interesting as ^(r) is the principal observable for 
large-scale structure analyses, the key motivation behind 
constructing mock galaxy catalogs. We create a mock 
galaxy catalog using the Ngai values predicted by our 
ML algorithms. We place these galaxies randomly within 
their host halos according to an NFW profile in the ra- 
dial direction and random point generation on a sphere 
in the angular directions. We calculate the correlation 
function for the training, true test set and predicted test 
set galaxies in 5/i~^Mpc bins in the range 5/i~^Mpc- 
60/i~^Mpc (suitable for redshift-spacc distortion analyses 
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from large-scale structure) . Figure [3] shows the resulting 
^(r) using SVM (top) and kNN (bottom). Each plot 
has 2 panels, the upper panel shows the actual correla- 
tion functions and the bottom panel shows the percent 
difference (i.e. 100 x [£,predict{r) - £.true{r)]/£.truelr)) be- 
tween the correlation function calculated from the pre- 
dicted galaxies in the test set versus that calculated from 
the true galaxies in the test set. 

In the SVM case we see that at small scales (^ 
5/i^^Mpc), the predicted correlation function has a lower 
amplitude than the true correlation function by a signif- 
icant amount (~ 40%). For kNN, the difference is fairly 
benign (~ 10%). At these small scales, the 1-halo term 
is still important and hence the fact that we underpre- 
dict the number of halos with Ngai > 1 as shown in 
Figure [5] will result in lower clustering amplitudes than 
expected. However, for analyses of large-scale structure, 
we are more interested in ^(r) at r > 30/i~^Mpc where 
our predicted ^(r) is only ^ 5 — 10% lower than the true 
correlation function for the most part. As the amount of 
underprediction is fairly constant, one can even imagine 
implementing a simple scaling correction for crude ap- 
plications. Hence, the potential for making large-scale 
structure mocks using SVM or kNN remains worthy of 
further investigation. 

As mentioned above, there is a slight bias towards 
under-predicting Ngai- This is likely due to the fact that 
the number of halos with small Ngai dominates the over- 
all distribution while halos with high Ngai are much rarer 
(see Figure [2]). Then, taking kNN as an example, even if 
a halo should have a large number of galaxies, its near- 
est neighbours may still be dominated by Ngai — 1 halos, 
which will lead the algorithm to underpredict. The cur- 
rent training set is clearly incomplete for halos with large 
N. 



gat- 



A larger training sample will have a proportion- 
ately larger number of these halos so using a larger train- 
ing set should be able to partially mitigate this problem. 
In addition, recall that the best kernel and parameters 
(C and 7) for the SVM and number of nearest neighbours 
k for the kNN are selected using the validation set on the 
basis of minimizing MSE. Since the MSE is a balanced 
measure of variance and bias, one can imagine giving a 
different weighting to the bias, i.e. penalizing the MSE 
more if the bias is high. This can potentially reduce the 
amount of underprediction we see, however, we will pay 
the price of having a larger scatter in our predictions. 

For SVM we underpredict the total number of galaxies 
by ~ 6% and for kNN we underpredict by ~ 3%. These 
are both small and do not appear to significantly alter the 
correlation function shown in Figure [3] at scales relevant 
to large-scale structure analyses. At these scales, the cor- 
relation function is dominated by the 2-halo term which 
comes mostly from halos containing a single galaxy. As 
discussed above, such halos are the most abundant by 
far. In addition. Figure [2] indicates that the number of 
halos we predict with Ngai ~ 1 matches the true distri- 
bution reasonably well. Hence, it is not surprising that 
our predicted ^(r) is in fair agreement with the true ^(r) 
at large r. 

We can also look at the distribution of Ngai as a func- 
tion of the various features. Figure|4]shows this in scatter 
plot form for the kNN test. The analogous plots for SVM 
are largely the same. Here we have randomly subsampled 



5~ 



U 

c 

<D 
i— 



CD 
U 

c 

<v 

i— 
QJ 



60 

40 

20 




■20 

■40, 

60 
40 
20 



r . 

^ ^ u 

< 


SVM_ 

1 


■ Train 

A Test True 

X Test Predicted 


■ si! 









30 

-1 



60 



r{h 'Mpc) 



kNN 



■ Train 

Test True 
X Test Predicted 



■ i 




30 

r(/i"^ Mpc) 

Fig. 3. — Correlation functions g(r) of the training, true test set 
and predicted test set galaxies. We create a mock galaxy catalog 
from the ML-predicted galaxies in order to calculate their ^{r). 
This is an excellent test of the ML predictions as g(r) is a key 
observable used for large-scale structure analysis, a major moti- 
vator for generating mock galaxy catalogs in the first place. The 
bottom panels of each plot show the percentage difference between 
the 5(r) calculated from the actual test set galaxies and the ML- 
predictions. The grey dashed line marks 0% to help guide the 
eye. The correlation functions of the ML-predicted galaxies match 
the true correlation function well overall. However in the SVM 
case, due to the underprediction of halos with Ng^i > 1 at small 
scales (around 5/i^^Mpc) where the 1-halo term is still important, 
the clustering amplitude appears to be 40% smaller than expected. 
This is not a major deterrent from using ML for making large-scale 
structure mocks since for these analyses we are mostly interested 
in scales greater than ^ 30/i~^Mpc where the difference between 
true and predicted ^{r) is mostly < 10%. In addition, the kNN 
case demonstrates better agreement between predicted and true 
5(r) at small scales. Hence, ML provides a potentially interesting 
alternative for creating large-scale structure mocks. 
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the number of points plotted to 3,000. The black points 
show true Ngai versus features while the red points show 
predicted Ngai ■ One can see that overall the span of the 
points overlaps quite well between the true and predicted 
sets, again indicating the general statistical agreement 
between the two. However, the black points do show 
slightly more spread overall. This indicates that the halo 
properties we have chosen to use here may not fully en- 
capsulate the mapping to Ngai, i-e. we are still missing 
some information that is relevant to the halo-galaxy re- 
lationship. This should not be surprising as most of our 
parameters {Np, A/200 1 ^-y Vmax) are effectively mass trac- 
ers. The ratio of R1/2 and i?200 (calculated easily from 
M200) can be thought of as a measure of concentration, 
effectively the ratio of 2 radii that are defined in the same 
way for all halos. Concentratio n is known to trace mass 

avarro. Frenk fc Whitel[T996( ) . bu t has also been found 
to correlate with ha l o environment (IWechsler et al1l2006l : 
IGao fc White! 120071 : IMaccio et al\\m)W 7 One can imag- 
ine that in addition to mass, spin and environment, there 
are additional factors (branching from the full merger 
history of the halo) that might affect Ngai ■ Fortunately, 
it appears that any other factors are not completely or- 
thogonal to the halo properties used in this study. As 
shown above, our parameters seem to capture most of 
the halo-galaxy mapping, at least in the context of the 
Millennium simulations. 

Finally, wc can perform a feature selection to identify, 
within the framework of Millennium, the most predictive 
halo property for Ngai ■ We employ a forward feature se- 
lection algorithm using SVM to do this. To ensure the 
stability of our results, we re-perform the feature selec- 
tion 10 times with different random draws of the training, 
validation and test sets. Key numbers are summarized 
in Table [T] The MSE values quoted for each feature un- 
der the column heading "first round" correspond to the 
median MSE values of the 10 trials in the first round of 
the forward feature selection. One can see that R1/2 has 
the smallest MSE in the first round of selection which in- 
dicates that it is the best predictor for Ngai- Using R1/2 
as the only feature for training gives a median MSE of 
0.163 which is very close to the MSE obtained using all 
features as shown in Figure [1] The other parameters 
yield MSE values ranging from 0.189 (M200) to 0.275 
(spin). The values quoted under "second round" corre- 
spond to the median MSE obtained by adding in another 
feature on top of Ri/2- One can see that this does not 
significantly change/improve on the minimum MSE from 
the first round. Hence it appears that most of our con- 
straint on Ngai is coming from R1/2 in the Millennium 
simulations. The selection of R1/2 should not be surpris- 
ing. It contains information about the halo mass and, 
as discussed above, can be related to halo environment 
through M200. 

4.1. Colour- dependent Mocks 

The Millennium semi-analytic galaxies come with 
b,v,r,i, z absolute magnitudes which we can use to define 
colours and split galaxies into blue and red subpopula- 
tions. This allows us to study whether or not ML-based 
methods for learning the halo-galaxy mapping and, most 
importantly, making mock catalogs can be directly ex- 
tended to subpopulations of galaxies that have different 



TABLE 1 

MSE VALUES OBTAINED BY PERFORMING A FORWARD FEATURE 
SELECTION USING SVM. THE VALUES QUOTED ARE THE MEDIAN 
MSE FROM 10 DIFFERENT RANDOMIZATIONS OF THE TRAINING, 
VALIDATION AND TEST SETS. THE HALO PARAMETERS WE USE ARE 
LISTED IN COLUMN 1 AND THE MEDIAN MSE VALUES OBTAINED BY 
USING ONLY THE LISTED PARAMETER ARE SHOWN IN COLUMN 2 

(First Round). One can see that R1/2 has the smallest 

MEDIAN MSE and hence IT SHOULD BE THE BEST PREDICTOR OF 
Ngai IN THE CONTEXT OF THE MILLENNIUM SIMULATIONS. ADDING 

IN ADDITIONAL PARAMETERS DOES NOT SIGNIFICANTLY CHANGE 
THE MEDIAN MSE AS INDICATED IN THE THIRD COLUMN (SECOND 

Round) which lists the median MSE values obtained using 

Rl/2 PLUS THE PARAMETER LISTED IN COLUMN 1. 



Parameter 


First Round 


Second Round 


Np 


0.216 


0.170 


M200 


0.189 


0.163 




0.220 


0.163 


^max 


0.229 


0.163 


Rl/2 


0.163 




spin 


0.275 


0.167 



colours. We define blue galaxies to have {v — r) < 0.7 
and red galaxies to have (w — r) > 0.7. This gives 175,177 
blue galaxies and 270,792 red galaxies. 

Wc again split the 395,832 halos equally into training, 
validation and test sets for each case (blue or red) and 
repeat the previous tests using kNN. Note that many of 
our halos now have red or blue galaxies reducing the 
size of our effective training sample. In the case of the 
blue galaxies, the MSE we obtain after applying kNN is 
0.186 as compared to a base MSE of 0.334. For the red 
galaxies, we obtain an MSE of 0.293 as compared to a 
base MSE of 0.738. In both cases, there is a significant 
reduction in MSE compared to the base MSE which in- 
dicates that the algorithm is learning information about 
Ngai from our input features. 

Correlation functions derived from our predictions are 
shown in Figure O One can see that the predicted ^(r) 
agrees fairly well with truth. Again at small scales ^ 
5h~^Mpc, we see that the predicted ^(r) has a lower clus- 
tering amplitude, especially for the red galaxies (~ 20%). 
If wc look at the distribution of halos with Ngai , we again 
sec that there is a small undcr-prcdiction in the num- 
ber of halos with Ngai > 1 that can cause this effect. 
This is slightly more problematic here as the main goal 
of studying the dependence of clustering on colour is to 
understand galaxy formation and evolution mechanisms 
which rely on information at small scales. However, such 
problems can be mitigated if wc had a larger set of train- 
ing data. We are now also approaching the point where 
the number of galaxies observed is large enough for us 
to begin understanding the differential clustering of blue 
and red galaxies at large scales. The agreement between 
the correlation function derived from our ML-predicted 
galaxies and the true correlation function at these scales 
is better (< 10%) and hence ML-based mock catalogs 
can be useful for these purposes. 

4.2. Stellar Mass-dependent Mocks 

Since the Millennium semi-analytic galaxy models also 
supply us with a stellar mass, we can split our galaxies 
into high stellar mass Mgtar and low stellar mass sam- 
ples. This can help us understand how effectively we can 
extend ML-based approaches to understanding the halo- 
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Fig. 4. — Distribution of Ng^i as a function of the different features. The black datapoints correspond to the true Ng^i values from 
the test set and the red datapoints correspond to the ML-predicted values. One can see that the true and predicted points have similar 
spreads, however, the true points tend to he slightly more spread out overall. This suggests that the features we have used to predict the 
mapping between halo properties and Ng^i do not fully capture this relationship, although they do well for the most part. We should 
not be surprised though as our features mostly trace halo mass and environment. In principle, the full merger history of the halos might 
impact more than mass and environment. The fact that these 2 key factors already predict the mapping to Ng^i as well as shown here is 
impressive. 
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Fig. 5. — Correlation functions of the training, true test set and 
predicted test set galaxies. Here, we have separated the galaxies 
into blue and red subpopulations. One can see that the general 
agreement between predicted and true §(r) is not bad with de- 
viations mostly < 10%. However, there appears to be a slight 
deficiency in power at small scales, especially for the red galaxies, 
in the predicted 5{r). This will make the study of galaxy evolu- 
tion more problematic since it draws on information contained in 
the small-scale clustering of galaxies. However, using ML to make 
color-dependent mocks at large scales remains a possibility. 



galaxy mapping as a function of stellar mass, including 
their potential for constructing mock catalogs for these 
distinct subpopulations of galaxies. We make the cut at 
IO^^Mq/H which gives 71,573 high Mgtar galaxies and 
374,410 low Mstar galaxies. Stellar mass and luminosity 
are correlated with each other and hence by performing 
this split we are effectively separating our galaxies into 
low and high luminosity samples. 

Once again we split the halos randomly and equally 
into training, validation and test sets. We then run kNN 
to predict Ngai for each of the high and low stellar mass 
cases. Note that in the high Mgtar case, most of our halos 
now contain galaxies so we have reduced the effective 
size of the training set by a large amount here. For the 



high mass case, we obtain an MSE of 0.119 after apply- 
ing kNN as compared to a base MSE of 0.250. For the 
low mass case, our MSE is 0.214 as compared to a base 
MSE of 0.283. Again, the MSE after putting the data 
through a kNN algorithm is smaller than the base MSE, 
indicating that the algorithm is learning about Ngai from 
the input features. 

A plot of the correlation functions derived from the 
ML-predicted galaxies is shown in Figure ID The low 
Mstar £,{t) agrees well with truth at large scales (< 5% 
deviation), however, it is again low near 5ft.~^Mpc. Like 
in the above case where we split by colour, this is slightly 
troublesome since studying the luminosity dependence 
of clustering is also mostly focused on understanding 
galaxy evolution through the clustering at small scales. 
Nonetheless, mocks produced using our predicted Ngai 
can still benefit studies of luminosity-dependent cluster- 
ing at large scales. The agreement in the high Mstar case 
is poor with deviations > 20%. However, this is because 
there are very few high stellar mass galaxies. With only a 
small number of halos with non-zero Ngai ■, it is not sur- 
prising that the algorithm has some difficulty with the 
regression. Looking at the overall distribution of halos 
with N gal in this case reveals an underprediction of halos 
with Ngai > (including Ngai = !)• Having fewer galax- 
ies that break the Ngai = 1 threshold effectively increases 
the galaxy bias and hence the clustering amplitude due 
to the 2-halo term which begins to become important 
near ^ 5ft-~^Mpc and is dominant at larger scales. 

5. CONCLUSIONS 

We have made some preliminary investigations into us- 
ing machine learning techniques to populate dark mat- 
ter halos from N-body simulations with galaxies. Since 
it is very computationally expensive to run cosmologi- 
cal N-body simulations with hydrodynamics, and per- 
turbation theory approaches tend to have problems on 
small-scales (such as treating redshift-space distortions 
correctly), machine learning serves as a powerful alter- 
native for creating large numbers of mock galaxy cata- 
logs. These are a key ingredient in large-scale structure 
analyses, a quickly emerging area with the advent of large 
galaxy surveys such as LSST, WFIRST and Euclid which 
will have effective survey volumes of ^ 10— 100/i^'^Gpc'^. 

Most importantly, machine learning brackets a sub- 
class of non-parametric algorithms which provide a 
unique way to construct the mapping from halo prop- 
erties to galaxies. Unlike other techniques such as HOD, 
we do not need to pre-suppose a known model for the 
halo-galaxy relationship. The only assumption that must 
be made is that a function taking halo properties to num- 
ber of galaxies per halo does exist and that it is smooth. 
It also allows us to circumvent the problems in subhalo 
identification which affect SHAM-based approaches. 

We test 2 machine learning algorithms, support vec- 
tor machines and k-nearest-neighbours, on the halos and 
semi-analytic galaxies in the Millennium simulation. We 
use 6 halo properties: number of particles Np, M200, c^v, 
maximum circular velocity Vmax, half-mass radius R1/2 
and halo spin, to characterize the mapping between halo 
properties and Ngai (the number of galaxies that reside in 
the halo). We find that both ML algorithms give mean- 
squared-errors of ~ 0.16 for the predicted Ngai, which 
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Fig. 6. — Correlation functions of the training, true test set and 
predicted test set galaxies. Here we have split the galaxies into 
high and low stellar mass samples. One can see that in the high 
Mstar case, the agreement between predicted and true ^(r) is poor 
with > 20% differences at most scales. However, this is due to the 
fact that high stellar mass galaxies are very rare. Most halos do 
not contain any high Mstar galaxies which reduces the effective 
size of our training sample by a large amount. In the low Mstar 
case, the predicted and true i,{r) agree to ~ 5% at most scales. 
Again, the slight underprediction of small scale power will make 
galaxy evolution studies difficult. However, the good agreement at 
large scales suggests that ML-based approaches for making mock 
catalogs have potential at these scales. 



is much smaller than the base MSE 0.505. The overall 
distribution of number of halos as a function of Ngai is 
also matched well by the ML predictions. We use our 
ML-predicted galaxies to create a mock galaxy catalog 
and calculate the correlation function from it. While in 
the SVM case we see a deficiency in clustering amplitude 
at small scales (~ 5/i~^Mpc) by ^ 40%, the predicted 
correlation function tracks that calculated from the true 
Millennium galaxies to < 10% at the scales most relevant 
to large-scale structure analyses. This is very important 
as large-scale structure science is the key motivator be- 



hind the construction of mock galaxy catalogs. 

Due to the rarity of halos with high Ngau we do find 
that there is a slight bias towards under-predicting Ngai ■ 
This can lead to the minor deficit in power at small 
scales in ^(r) as mentioned above, however, this docs not 
appear to significantly deter our ability to make mock 
catalogs. As previously stated, we obtain a reasonably 
good matching between the predicted and test correla- 
tion functions at scales relevant for large-scale structure 
studies. 

We see that the predicted and true Ngai values as a 
function of the features are similar in spread. However, 
for a given Ngau the spread in the features is slightly 
larger for the true values. This suggests that the features 
we have used here do not fully capture the mapping be- 
tween halo properties and Ngai, although, they do come 
very close. One should not be surprised by this since 
our features mostly trace halo mass and environment. 
While key, the full merger history of the halos is likely 
to impact properties of the halo beyond just mass and 
environment. 

We also demonstrate a simple feature selection proce- 
dure on our halo properties. Feature selection is merely 
the process by which we use machine learning to iden- 
tify the halo property most predictive in the mapping to 
Ngai- In the context of the Millennium simulations, our 
feature selection algorithm identifies R1/2 as the most 
relevant parameter. This is not terribly surprising as 
Ri/2 is germane to both halo mass and environment. 

Finally we investigate direct extensions of our ML 
algorithms to understanding the halo-galaxy mapping 
and making mock catalogs for various subpopulations 
of galaxies (i.e. blue, red, low Mstar and high Mstar)- 
Wc find that in general the agreement between the pre- 
dicted and true correlation functions is fair. However, 
once again we observe an underprediction of power at 
small scales in most cases. As studies of differential 
clustering in various subpopulations are aimed at un- 
derstanding galaxy formation and evolution which draw 
on information contained in the small scales of the cor- 
relation function, this is slightly more problematic here. 
However, we are entering an era where the number of 
observed galaxies is large enough to engage in studies of 
differential subpopulation clustering at large scales. Our 
ML-predicted ^(r) matches truth to ~ 5 — 10% at these 
scales and hence provides an interesting alternative for 
creating mocks for such studies. 

The key advantage of ML is that it offers a method 
of inferring the halo-galaxy mapping in a model- 
independent manner. In addition, it is computationally 
inexpensive: for example, to train an SVM on 170, 000 
points as done in this study takes < 1 hour on a single 
core. If we can run a large cosmological simulation (N- 
body plus hydrodynamics) with galaxy formation cali- 
brated against available observations, we should be able 
to use the ML algorithms tested here to learn the map- 
ping from halo properties to Ngai very quickly. We can 
then create large sets of mock galaxy catalogs from pure 
N-body simulations which are much less computationally 
expensive than running a large number of these fully- 
armed cosmological simulations. 

There also exist a number of avenues for future in- 
vestigation including the use of ML-based approaches to 
predict not only Ngai but also the positions and veloci- 
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ties of galaxies within the parent halo. This is much more 
complex as it requires predicting a distribution (i.e. mul- 
tiple galaxy positions (x,?/, z) and velocities (wx,Wy,^^z)) 
for each halo. One can also imagine devising methods to 
mitigate the bias towards underprediction (i.e. penaliz- 
ing the MSE more for biased predictions or giving more 
weight to high Ngai neighbours in a kNN implementa- 
tion). These will all aid in our quest to generate mock 
catalogs reliably and efficiently. 
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